A Cost Model and Index Architecture for the Similarity Join

نویسندگان

  • Christian Böhm
  • Hans-Peter Kriegel
چکیده

The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter ε. Due to its high practical relevance, many similarity join algorithms have been devised. In this paper, we propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: Fine-grained index structures are beneficial for the CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort. In the experimental evaluation, a substantial improvement over competitive techniques is shown.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection

Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...

متن کامل

A Fast Algorithm for high-dimensional Similarity Joins

Many emerging data mining applications require a similarity join between points in a highdimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of nd...

متن کامل

High Performance Data Mining Using the Nearest Neighbor Join

The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance ...

متن کامل

Implementation of Low-Cost Architecture for Control an Active Front End Rectifier

In AC-DC power conversion, active front end rectifiers offer several advantages over diode rectifiers such as bidirectional power flow capability, sinusoidal input currents and controllable power factor. A digital finite control set model predictive controller based on fixed-point computations of an active front end rectifier with unity displacement of input voltage and current to improve dynam...

متن کامل

High-Dimensional Similarity Joins

Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001